Discretization of Numerical Attributes Preprocessing for Machine Learning
نویسنده
چکیده
Page 2 of 46 Abstract The area of Knowledge discovery and Data mining is growing rapidly. A large number of methods is employed to mine knowledge. Several of the methods rely of discrete data. However, most datasets used in real application have attributes with continuously values. To make the data mining techniques useful for such datasets, discretization is performed as a preprocessing step of the data mining. In this paper we examine a few common methods for discretization and test these algorithms on common datasets. We also propose a method for reducing the number of intervals resulting from an orthogonal discretization by compromising the consistency level of a decision system. The algorithms have been evaluated using a rough set toolkit for data analysis. In Chapter 2, we introduce Knowledge Discovery. We also discuss the preprocessing step in the Knowledge Discovery pipeline, and introduce discretization in particular. Chapter 3 introduces some basic notions in Rough Set theory. In Chapter 4, we further discusses the discretization process, and investigates some common methods for discretization. In Chapter 5, we propose a two-step approach to discretization, using the Naive discretization algorithm introduces in 4.2, and a proposed algorithm to merge intervals. Empirical results from comparison of the different algorithms discussed in Chapter 4, and the proposed method from Chapter 5 can be found in Chapter 6. In Chapter 7, further work in the area of discretization is discussed. Appendix A contains some notes on the Rosetta[15] framework, and the implementation of the investigated algorithms in particular.
منابع مشابه
Empirical comparisons of various discretizationprocedures
The genuine symbolic machine learning (ML) algorithms are capable of processing symbolic, categorial data only. However, real-world problems, e.g. in medicine or nance, involve both symbolic and numerical attributes. Therefore, there is an important issue of ML to discretize (categorize) numerical attributes. There exist quite a few discretization procedures in the ML eld. This paper describes ...
متن کاملDiscretization and Grouping: Preprocessing Steps for Data Mining
Unlike on-line discretization performed by a number of machine learning (ML) algorithms for building decision trees or decision rules, we propose off-line algorithms for discretizing numerical attributes and grouping values of nominal attributes. The number of resulting intervals obtained by discretization depends only on the data; the number of groups corresponds to the number of classes. Sinc...
متن کاملGlobal discretization of continuous attributes as preprocessing for machine learning
Real-life data usually are presented in databases by real numbers. On the other hand, most inductive learning methods require a small number of attribute values. Thus it is necessary to convert input data sets with continuous attributes into input data sets with discrete attributes. Methods of discretization restricted to single continuous attributes will be called local, while methods that sim...
متن کاملOptimized Preprocessing for Accurate and Efficient Bioassay Prediction with Machine Learning Algorithms
Bioassay is the measurement of the potency of a chemical substance by its effect on a living animal or plant tissue. Bioassay data and chemical structures from pharmacokinetic and drug metabolism screening are mined from and housed in multiple databases. Bioassay prediction is calculated accordingly to determine further advancement. This paper proposes a four-step preprocessing of datasets for ...
متن کاملDiscretization of Continuous Attributes in Supervised Learning algorithms
We propose a new algorithm, called CILA, for discretization of continuous attribute. The CILA algorithm can be used with any class labeled data. The tests performed using the CILA algorithm show that it generates discretization schemes with almost always the highest dependence between the class labels and the discrete intervals, and always with significantly lower number of intervals, when comp...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007